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Word embedding has become the most popular method of lexical description 
in a given context in the natural language processing domain, especially 
through the word to vector (Word2Vec) and global vectors (GloVe) 
implementations. Since GloVe is a pre-trained model that provides access to 
word mapping vectors on many dimensionalities, a large number of 
applications rely on its prowess, especially in the field of sentiment analysis. 
However, in the literature, we found that in many cases, GloVe is 
implemented with arbitrary dimensionalities (often 300d) regardless of the 
length of the text to be analyzed. In this work, we conducted a study that 
identifies the effect of the dimensionality of word embedding mapping 
vectors on short and long texts in a sentiment analysis context. The results 
suggest that as the dimensionality of the vectors increases, the performance 
metrics of the model also increase for long texts. In contrast, for short texts, 


Word embedding we recorded a threshold at which dimensionality does not matter. 
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1. INTRODUCTION 

Research fields related to natural language processing, e.g. information retrieval [1], [2], document 
classification [3], named entity recognition [4], machine translation [5], sentiment analysis [6], [7], 
recommendation systems [8] or audience segmentation [9], [10] have in common that they are problems of 
perception related to our senses. Thus, they have always represented a great challenge for researchers 
because it is particularly difficult to describe a text using algorithms and mathematical formulas. Therefore, 
the first models deployed in this field were based on a certain expertise such as the passage through 
grammatical and syntactic rules. Several years have been devoted to research on the exploitation and 
transformation of this unstructured data in order to give it meaning. One of the most successful techniques is 
word embedding. 

The foundations of word embedding were set by the linguistic theory of Zelling Harris, also known 
as distributional semantics [11], [12]. This theory states that a word is characterized by its context formed by 
the words around it. Therefore, words that share similar contexts also share the same meanings. 

Word embedding is a numerical representation of text where words that share the same meaning 
also share a similar representation. Word embedding consists of representing each word in the dictionary as 
real-valued vectors in a defined vector space. These vectors are often generated using neural network-based 
models. As a result, the word embedding technique is often grouped into the deep learning domain. Indeed, 
the principle of using neural networks to model high-dimensional discrete distributions has already been 
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supported for learning the joint probability of a set of random variables where each is likely different in 
nature [13]. Thus, the idea of word embedding is to use a dense distributed representation for each word, 
which results in vectors composed of dozens or hundreds of dimensions, contrasting with the thousands or 
millions of dimensions required for sparse word representations, such as one-hot encoding. Indeed, when 
applying one-hot encoding to words, we end up with high-dimensional sparse vectors containing a large 
number of zeros. On large datasets, this can lead to performance problems. Moreover, one-hot encoding does 
not take into account the semantics of words [14]. 

The word embedding approach seeks to associate each vocabulary word with a distributed word 
feature vector. The feature vector represents various aspects of the word that is associated with a point in a 
vector space. The number of features on which words are mapped is significantly smaller than the vocabulary 
size. In addition, the semantic relationships between words are reflected in the distance and direction of the 
vectors [15]. 

The idea of identifying similarities between words to generalize training sequences to new 
sequences dates back to the early 1990s. For example, it is used in approaches based on learning a grouping 
of words. Each word is deterministically or probabilistically associated to a class and words of the same class 
have a certain silimarity [16], [17]. 

In our study, we sought to identify the existence of a correlation between the number of dimensions 
of a word embedding vector and the performance of a sentiment analysis model according to the size of the 
text to be analyzed. We used a recurrent neural network gated recurrent unit (GRU), whose input was 
coupled to the word embedding representation vector using the global vectors (GloVe) model with 
dimensionality of 50, 100, 200 and 300, respectively. We computed performance metrics, including accuracy 
and F1 score, on short texts from the Twitter Airlines Sentiment dataset and relatively long texts from the 
Internet Movie Database dataset. The results of this study show that the dimensions of the word embedding 
vectors have a positive impact on the performance metrics for long texts, while these dimensions do not 
matter for short texts above a certain threshold. 


2. LITERATURE REVIEW 
2.1. Word embedding 

Among the goals of statistical language modeling is learning the joint probability function of word 
sequences in a language. This task is intrinsically difficult because of the high dimensionality. Therefore, a 
word sequence on which the model will be evaluated is most likely to be different from all word sequences 
seen in training phase. Traditional n-gram based approaches succeed in generalizing by clustering very short 
overlapping sequences in the training set. However, the resulting models contain millions of parameters and 
thus learning them in a reasonable time is a complex task [15]. From a historical point of view, the encoding 
of words according to certain characteristics of their meaning began in the 1950s and 1960s [15]. In 
particular, the vector model in information retrieval makes it possible to represent a complete document by a 
mathematical object that aims to capture its meaning. 

There are two approaches to encoding the context of a word: frequency-based approaches that count 
the words co-occurring with a given word in order to create dense vectors of small dimensions [18] and 
lexical embeddings that seek to predict a given word using its context or vice versa. This is the case for 
example of the word to vector (Word2Vec) algorithm [19]. This last approach relies on artificial neural 
networks to build these vectors. These models are trained on very large corpora (up to 1.6 billion words per 
day) in order to predict a word from its context or vice-versa. The Word2Vec model has two different 
architectures for creating word embeddings; the continuous bag-of-words (CBOW) model which attempts to 
predict a word from its neighboring words and the skip-gram model which attempts to predict context words 
from a central word [19]. It has been shown that distributed representations based on neural networks 
significantly outperform n-gram models [20]—[22]. Furthermore, since it is difficult to determine the exact 
number of meanings for each word, as the meaning of the word depends on the context, models such as 
adaptive probabilistic word embedding (APWE), where the polysemy of the words is defined on a latent 
interpretable semantic space [23] or word sense disambiguation [24] have been proposed. 


2.2. GloVe 

Recently, methods for learning the vector space of word representations have succeeded each other 
in capturing the fine-grained semantics of syntactic regularities using vector arithmetics. Nevertheless, the 
origin of these regularities has remained unclear. In order to bring out these regularities in word vectors, 
researchers at Stanford University combine the advantages of the two major families of models in the 
literature, namely the global matrix factorization and the local contextual window. The result is a pre-trained 
model named GloVe [25]. 
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The approach used by the GloVe method for word integration is different. Indeed, it is an 
unsupervised learning algorithm that computes vector representations for words. The model is trained on 
aggregate word-word co-occurrence statistics of a given corpus. The resulting representations present 
interesting linear substructures of the word vector space. In fact, unlike Word2 Vec, GloVe does not rely only 
on local statistics (information about the local context of words), but integrates global statistics (word co- 
occurrence) to generate word vectors [25]. 

The 50-, 100-, 200-, and 300-dimensional GloVe word vectors were trained on the Wikipedia dump 
and the gigaword 5 data corpus. They encode 400,000 tokens as single vectors and all tokens outside the 
vocabulary were encoded as a vector of zeros. The richness and robustness of GloVe vectors have allowed it to 
be at the heart of many works related to natural language processing as in [26], where the authors introduced an 
innovative MapReduce enhanced decision tree classification approach. They used several feature extractors, 
including the GloVe model, to efficiently detect and capture relevant data from given tweets. Alexandridis et al. 
[27] used various language models to represent social media texts and Greek language text classifiers, using 
word embedding implemented by the GloVe model, to detect the polarity of opinions expressed on social 
media. The GloVe model has also been used in sentiment analysis models, often associated with a recurrent 
neural network module like long sort-term memory (LSTM) or GRU [6], [28], [29]. 


3. ARCHITECTURE AND METHODOLOGY 

The objective of our study is to evaluate the effect of the dimensionality of the word embedding 
vectors, implemented by the GloVe model, on the performance metrics related to sentiment analysis within 
short and long texts. The training and test data come from the Twitter US Airlines Sentiments [30] and 
internet movie database (IMDb) [31] datasets, with an overall average sentence length of 17 and 231 words 
respectively. For each dataset, we used the GloVe model implementing 50, 100, 200 and 300 dimensions. 
The sentiment analysis is performed using a GRU recurrent neural network. At the output, we retrieved the 
binary score of positivity or negativity of the sentiment experienced in the input instances as shown in Figure 1. 


Sentence Meg or ~o GRU Layer Binary sentiment 


Figure 1. Proposed GRU-based sentiment analysis model for the study 


3.1. Preprocessing 

After cleaning and filtering the data from the two datasets, we proceeded to tokenization which 
consists in dividing the text into single occurrences or combinations of several successive occurrences of the 
same length. This operation also allows us to map the vocabulary of all the words of the dataset in a 
dictionary in order to train our model. We selected 10,000 and 2,000 tweets respectively for the train set and 
the test set to represent the short texts and 5,000 and 1,000 IMDb comments respectively for the long texts. In 
this study, we used word embedding by implementing the GloVe model which uses vectors of single words. 
Therefore, we segmented our sentences into single-word tokens. For each dataset used in this study, the 
entries do not have the same length. However, in order for our GRU cell-based model to work properly, we 
have defined the same sequence length which corresponds to the number of time steps for the GRU layer 
which is the maximum length calculated for a training text (36 tokens in the case of Twitter US Airlines 
Sentiments and 2,470 tokens in the case of IMDb). 


3.2. Word embedding using GloVe 

Word embedding improves text classification by solving the problem of sparse matrix (such as that 
resulting from the coding scheme of the bag-of-words model) and word semantics. In our study, we 
implemented word embedding using the GloVe model. Learning is performed on global word-word co- 
occurrence statistics aggregated from a text. The resulting representations thus show linear substructures of 
the word vector space [25]. 

In order to identify the effect of dimensionality of word embedding vectors. We implemented the 
GloVe model with dimensions 50, 100, 200, and 300 for both short and long texts. The dimension used will 
have a direct impact on the "Input vocab size" hyperparameter of the GRU model used for sentiment analysis 
as shown in Table 1. 
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3.3. GRU layer 

Tokenization GRU, introduced by Cho ef al. [32], is a class of recurrent neural networks that aims to 
solve the vanishing gradient problem that naturally accompanies traditional recurrent neural networks. This 
problem arises when backpropagating through the recurrent neural network during training, especially for 
networks with deeper layers [33], [34]. GRU is similar to the LSTM networks of the recurrent neural network 
class with a forgetting gate [35], but without the exit gate that characterizes LSTMs [33]. The performance of 
GRU on some tasks, in this case natural language processing, is similar to that of LSTMs [36], [37]. In 
addition, the implementation of GRU is simpler and less expensive in terms of computing power. For our 
study, we implemented the GRU layer with the hyerparameters shown in Table 1, in order to handle short 
texts (Twitter Airlines Sentiment) and long texts IMDb). 


Table 1. Hyperparameters applied to the GRU model deployed for sentiment analysis 
Hyperparameter Short text (Twitter) _ Long text IMDb) 


Input vocabulary size 36 2,470 
Output embedding dim. 50, 100, 200,300 50, 100, 200, 300 
GRU layer internal units 256 256 

Optimizer Adam Adam 
Loss Categ. Crossentropy Categ. Crossentropy 
Activation function Softmax Softmax 


3.4. Softmax layer 

Softmax constitutes a generalization of logistic regression. It can be used in the case of multi-class 
classification. The softmax function transforms a K real values vector into a vector of the same dimension 
whose values add up to | (1). Thus, the softmax transforms the input values into values between 0 and |. The 
latter can thus be interpreted as a normalized probability distribution. 


e 


2j : 
O(2)j = 5am for j € {1,...,K} (1) 


In practice, most multilayer neural networks end with a softmax layer that produces scaled real- 
valued scores that are easier to manipulate in further processing [6]. Indeed, in our work, we used the 
softmax layer that delivers two probability scores that represent the positivity and negativity of the input 
model text. The higher probability characterizes the overall binary sentiment experienced in the input 
sentence. 


4. RESULTS 
4.1. Evaluation of the dimensionality of word embedding on short texts 

In practice, To the input of our model as shown in Figure 1, we applied texts from the Twitter US 
Airlines Sentiments dataset with an average length of 17 words. The longest tweet is 36 words. In the word 
embedding layer, we applied the GloVe model with the vectors of 50, 100, 200 and 300 dimensions 
respectively. 

We can see that the accuracy of the model is 0.904 if we apply the word embedding by 
implementing the GloVe model whose words are mapped on 50 dimensions. This accuracy is 0.943, 0.944 
and 0.946 if we increase the dimensionality of the word embedding to 100, 200 and 300 respectively as 
shown in Table 2. As for the Fl score (4), which summarizes the values of accuracy (2) and recall (3) as a 
harmonic mean, it is worth 0.721, 0.754, 0.747 and 0.773 with the respective dimensionalities of 50, 100, 200 
and 300. 


rae True Positive 
Precision = — — (2) 
True Positive + False Positive 


Retail= True Positive (3) 


True Positive + False Negative 


Fl-—2+% Precision * Recall (4) 


Precision + Recall 


We can subtly see that both performance metrics increase from dimensionality 50 to 100 as shown 
in Figure 2. However, accuracy remains almost constant from dimensionality 100 onwards. As for the Fl 
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score, it climbs slightly beyond this threshold (after having recorded a slight drop for dimensionality 200). As 
for the accuracy, its maximum value (0.805) was recorded for dimensionality 50. 


Table 2. Performance metrics for short texts according to the different dimensionalities of the GloVe model 
Dimentionality Accuracy Precision Recall _ Fl Score 


50d 0.904 0.805 0.652 0.721 
100d 0.943 0.744 0.764 0.754 
200d. 0.944 0.672 0.840 0.747 
300d 0.946 0.789 0.758 0.773 


4.2. Evaluation of the dimensionality of word embedding on long texts 

Concerning the long texts which come from the IMDb dataset, the average length of the comments 
is 231 words. The longest comment is 2,470 words. As for the short texts, we applied the same dimensions in 
the word embedding layer and kept the same hyperparameters as shown in Table |. The performance metrics 
for this dataset have been reported in Table 3. 

As can be seen, all performance metrics increase as the dimensionality of the word embedding 
increases as shown in Figure 3. Indeed, the accuracy increases from 0.686 for dimensionality 50 to 0.854 for 
dimensionality 300. The Fl score increases from 0.711 to 0.854. Precision and recall also increase 
significantly from 0.830 and 0.622 to 0.918 and 0.8 respectively. Therefore, it seems justified that one could 
extrapolate the observed results and deduce that the performance metrics concerning sentiment analysis in 
long texts will continue to evolve positively as the dimensionality of the vectors used for word embedding 
increases, at least up to a certain threshold that is greater than or equal to the 300d dimensionality. 


Table 3. Performance metrics for long texts according to the different dimensionalities of the GloVe model 
Dimentionality | Accuracy Precision Recall _ Fl Score 


50d 0.686 0.830 0.622 0.711 
100d 0.784 0.830 0.739 0.782 
200d 0.825 0.891 0.770 0.826 
300d 0.854 0.918 0.800 0.854 
1.0 1.0 
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Figure 2. Graphical representation of the evolution of Figure 3. Graphical representation of the evolution of 
performance metrics for short texts performance metrics for long texts 


5. DISCUSSION 

The results of our study clearly indicate that for long texts, such as IMDb comments, the 
performance metrics evolve as the dimensionality of the word embedding increases. On the other hand, for 
short texts such as tweets, we found that the performance metrics, in this case accuracy and FI score which 
combines precision and recall, increase up to the 100d dimensionality threshold and then stabilize. Indeed, 
even if the dimensionality increases after reaching the 100d threshold, we notice that the model is almost 
insensitive to it. This behavior suggests to us that there are probably dimensions in the word mapping vector 
whose utility is minimized. We attribute this to the existence of dimensions that likely have anomalies in the 
GloVe model due to some parameters not being set to optimized global values [38]. The effect of these 
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anomalies is revealed in the case of short texts like tweets. This is likely due to the difficulty encountered 
when disambiguating words in such texts [39]. 

Therefore, it would be wise to adopt the minimum dimensionality that ensures the best performance 
in order to optimize the use of computational resources. Indeed, word embedding with a small dimensionality 
is generally not expressive enough to capture all possible word relations, while a very large dimensionality is 
subject to over-fitting. In addition, the number of parameters for a word embedding or a model that relies on 
word embeddings, such as recurrent neural networks, is usually a linear or quadratic function of 
dimensionality, which affects training time and computational costs [40]. Our study showed that this 
dimensionality is 100d for short texts, such as tweets or comments related to blog posts and 300d for long 
texts, such as IMDb comments or reviews left on Airbnb [41]. Indeed, mapping each word of the corpus on 
such a large number of dimensions, and even more so if the corpus is large, could increase the complexity of 
the model, slow down the training speed and add inferential latency, which has as a direct consequence, the 
impossibility to deploy the model on real tasks. 


6. CONCLUSION 

The content generated by users of microblogs, such as social networks or opinion sharing sites, is a 
rich and abundant source of opinions and information. If carefully studied, it offers great potential for 
extracting useful and precious knowledge, in this case in terms of sentiment analysis, which aims to identify 
the opinion and subjectivity of people's feedback from unstructured text written in natural language. The 
machine learning models involved in performing sentiment analysis very often require mapping the input text 
into vectors that contain real values. This statistical modeling of language involves learning the joint 
probability function of sequences of words in a language, but which is marked by high dimensionality. There 
are solutions, such as n-grams, which give the possibility to obtain a generalization of overlapping word 
sequences, but they result in models that contain an excessively large number of parameters, which makes it 
impossible to train them in a reasonable time. The solution to such a problem consists in word embedding 
which is a vector model in information retrieval and which allows to represent a complete document by a 
mathematical object which aims at elucidating its meaning. Although the Word2vec model is a reference in 
terms of word embedding, the GloVe model, which is an unsupervised learning algorithm allowing to obtain 
vector representations of words, also seems to be very popular in some natural language processing-related 
domains such as sentiment analysis. GloVe allows to map the words of the dictionary on vectors of several 
dimensions, the most frequent being 50d, 100d, 200d and 300d. In our study, we investigated whether the 
dimensionality of the vector implementing the GloVe model can have an impact on performance metrics in 
relation to sentiment analysis for short and long texts. We therefore integrated GloVe into a sentiment 
analysis model based on GRU recurrent neural networks. Then, we trained it on corpora coming from the 
Twitter US Airlines Sentiments dataset which contains short texts and the IMDb dataset which contains 
relatively long texts. Each time, we applied a word mapping through vectors that implement the word 
embedding using the Glove model with a different dimensionality, in this case 50d, 100d, 200d and 300d. 
The results suggest that for short texts, the performance metrics (i.e., accuracy and F1 score) increase up to 
the 100d threshold and then stabilize. Thus, the use of word embedding through higher dimensionality 
vectors has almost no impact on the performance of our sentiment analysis model. On the other hand, for 
long text, we found that performance metrics increase the more we use word embedding across higher 
dimensionality vectors. Therefore, in order to optimize computational resources, we suggest using 100- 
dimensional word mapping through the GloVe model for short texts. On the other hand, it is recommended to 
use a word mapping with high dimensionality for long texts, within the limit that allows to find a 
compromise between the resources and the computational time on the one hand and the targeted performance 
metrics on the other hand. 
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